-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Couch stats resource tracker v2 #5213
base: main
Are you sure you want to change the base?
Conversation
GroupBy = fun couch_stats_resource_tracker:group_by/2, | ||
SortedBy1 = fun couch_stats_resource_tracker:sorted_by/1, | ||
SortedBy2 = fun couch_stats_resource_tracker:sorted_by/2, | ||
ConvertEle = fun(K) -> list_to_existing_atom(binary_to_list(K)) end, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end | ||
end, | ||
|
||
{Resp, _Bad} = rpc:multicall(erlang, apply, [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
erpc might be a better option https://www.erlang.org/doc/apps/kernel/erpc.html#multicall/3
handle_resource_status_req(#httpd{method = 'POST'} = Req) -> | ||
ok = chttpd:verify_is_server_admin(Req), | ||
chttpd:validate_ctype(Req, "application/json"), | ||
{Props} = chttpd:json_body_obj(Req), | ||
Action = proplists:get_value(<<"action">>, Props), | ||
Key = proplists:get_value(<<"key">>, Props), | ||
Val = proplists:get_value(<<"val">>, Props), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're introducing a new API add some docs or some description (examples) how it works, and the intent behind it. What would CouchDB users use it for, how it is different than metrics, and active tasks, etc...
end, | ||
[] | ||
]), | ||
%%io:format("{CSRT}***** GOT RESP: ~p~n", [Resp]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left-over debug comment
end, | ||
[] | ||
]), | ||
%% TODO: incorporate Bad responses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using erpc, the handling of bad responses is a bit cleaner there, I think
ok = chttpd:verify_is_server_admin(Req), | ||
{Resp, Bad} = rpc:multicall(erlang, apply, [ | ||
fun() -> | ||
{node(), couch_stats_resource_tracker:active()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having to run in a closure just to get node() in the response, consider using an explicit list of nodes [node() | nodes()]
with a plain [M, F, A]
and then zipping over the nodes in the response. It's a bit longer but it's less fragile.
{[couchdb, query_server, js_filter], [ | ||
{type, counter}, | ||
{desc, <<"number of JS filter invocations">>} | ||
]}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a .query_server.calls.ddoc_filter
counter
{[couchdb, btree, get_node, kp_node], [ | ||
{type, counter}, | ||
{desc, <<"number of couch btree kp_nodes read">>} | ||
]}. | ||
{[couchdb, btree, get_node, kv_node], [ | ||
{type, counter}, | ||
{desc, <<"number of couch btree kv_nodes read">>} | ||
]}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if this is a bit too low level? Or rather, btrees are used for view and db files. It's not clear which btrees the metrics refer to.
%% TODO: wire in csrt tracking | ||
couch_stats:increment_counter([couchdb, query_server, js_filter_error]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a general counter for os process errors, normal exits and error exits. Probably don't need an explicit one just for filters?
%% Only potentially track positive increments to counters | ||
-spec maybe_track_local_counter(any(), any()) -> ok. | ||
maybe_track_local_counter(Name, Val) when is_integer(Val) andalso Val > 0 -> | ||
%%io:format("maybe_track_local[~p]: ~p~n", [Val, Name]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug left-over?
%% If after, only currently tracked metrics declared in the app's | ||
%% stats_description.cfg will be trackable locally. Pros/cons. | ||
%io:format("NOTIFY_EXISTING_METRIC: ~p || ~p || ~p~n", [Name, Op, Type]), | ||
ok = maybe_track_local_counter(Name, Value), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is enabled for all requests, a bit worried about the performance here as we add an ets operation over a simple integer counter bump. Was there any noticeable performance impact from it, during perf runs?
%% Should maybe_track_local happen before or after notify? | ||
%% If after, only currently tracked metrics declared in the app's | ||
%% stats_description.cfg will be trackable locally. Pros/cons. | ||
%io:format("NOTIFY_EXISTING_METRIC: ~p || ~p || ~p~n", [Name, Op, Type]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left-over debug statement
-export([ | ||
get_pid_ref/0, | ||
set_pid_ref/1, | ||
create_pid_ref/0, | ||
close_pid_ref/0, close_pid_ref/1 | ||
]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider just doing one per-line for consistency with most of the code-base.
end. | ||
|
||
%% monotonic time now in millisecionds | ||
tnow() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: consider adding units to the function name. now_msec
or something like that.
active() -> active_int(all). | ||
active_coordinators() -> active_int(coordinators). | ||
active_workers() -> active_int(workers). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using one-line functions. It's shorter, but it's unusual for our code base. They may work for a larger table of values sometimes but for these cases we put the body on a new line.
inc(?DB_OPEN, N) -> | ||
update_counter(#rctx.?DB_OPEN, N); | ||
inc(?ROWS_READ, N) -> | ||
update_counter(#rctx.?ROWS_READ, N); | ||
inc(?FRPC_CHANGES_RETURNED, N) -> | ||
update_counter(#rctx.?FRPC_CHANGES_RETURNED, N); | ||
inc(?IOQ_CALLS, N) -> | ||
update_counter(#rctx.?IOQ_CALLS, N); | ||
inc(?COUCH_JS_FILTER, N) -> | ||
update_counter(#rctx.?COUCH_JS_FILTER, N); | ||
inc(?COUCH_JS_FILTER_ERROR, N) -> | ||
update_counter(#rctx.?COUCH_JS_FILTER_ERROR, N); | ||
inc(?COUCH_JS_FILTERED_DOCS, N) -> | ||
update_counter(#rctx.?COUCH_JS_FILTERED_DOCS, N); | ||
inc(?MANGO_EVAL_MATCH, N) -> | ||
update_counter(#rctx.?MANGO_EVAL_MATCH, N); | ||
inc(?DB_OPEN_DOC, N) -> | ||
update_counter(#rctx.?DB_OPEN_DOC, N); | ||
inc(?FRPC_CHANGES_ROW, N) -> | ||
update_counter(#rctx.?ROWS_READ, N); %% TODO: rework double use of rows_read | ||
inc(?COUCH_BT_GET_KP_NODE, N) -> | ||
update_counter(#rctx.?COUCH_BT_GET_KP_NODE, N); | ||
inc(?COUCH_BT_GET_KV_NODE, N) -> | ||
update_counter(#rctx.?COUCH_BT_GET_KV_NODE, N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing that we have to do this both for incrementing and to_json wonder if it's not simpler to just keep a map with keys as map keys.
%% io:format("~n**********MISSING STARTING DELTA************~n~n", []), | ||
couch_stats:increment_counter( | ||
[couchdb, csrt, delta_missing_t0]), | ||
%%[couch_stats_resource_tracker, delta_missing_t0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented debug line?
handle_call({call_search, _}, _From, St) -> | ||
%% TODO: provide isolated search queries here | ||
{reply, ok, St}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't do anything currently
code_change(_OldVsn, St, _Extra) -> | ||
{ok, St}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is optional we don't need to define it
handle_info(Msg, St) -> | ||
{stop, {unknown_info, Msg}, St}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is optional as well, may have to double-check
terminate(_Reason, _St) -> | ||
ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is optional
named_table, | ||
public, | ||
{decentralized_counters, true}, | ||
{write_concurrency, true}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using auto
here
catch evict(PidRef), | ||
demonitor(MonRef), | ||
ok; | ||
{'DOWN', MonRef, _Type, _0DPid, _Reason0} -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_0DPid
looks a bit odd just like _Reason0
can just use Pid
to match exactly or _Pid
or just _
%% TODO: do we need cleanup here? | ||
log_process_lifetime_report(PidRef), | ||
catch evict(PidRef), | ||
demonitor(MonRef), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're stopping probably not worth bothering demonitoring
%% TODO: decide on which naming scheme: | ||
%% {[fabric_rpc, get_all_security, spawned], [ | ||
%% {[fabric_rpc, spawned, get_all_security], [ | ||
{[fabric_rpc, get_all_security, spawned], [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-streaming requests the spawn seem redundant. For instance we don't have get_all_security spawns, then row sends. Could just have them all as fabric_rpc.$call
?
_ -> | ||
%% Got a message from an old Ref that timed out, try again | ||
await_shard_response(Ref, Name, Rest, Opts, Factor, Timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the logic a bit to drop any messages. This doesn't seem to be related to the stats tracking, but it's a good change and might be better do it as a separate PR to discuss there.
This goes back to the effort some years back to use {rexi, ...} prefix for rexi messages to avoid dropping or consuming any 2 items tuple message. Wonder if we should finish that work first and incorporate message cleanup with it. But overall it does seems like a problem for a different PR.
Other -> | ||
?LOG_UNEXPECTED_MSG(Other), | ||
wait_message(Node, Ref, Mon, Timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the logic here by dropping unknown(stale?) message. See a similar comment for get_shard. It might be good to do that in a separate PR.
@@ -35,6 +35,7 @@ start(Procs) -> | |||
%% messages from our mailbox. | |||
-spec stop(pid()) -> ok. | |||
stop(MonitoringPid) -> | |||
unlink(MonitoringPid), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder why we have to do this? Was there a crash that happened with stats tracking? We're exiting the monitor pid anyway. If this is not related to the stats tracking might be a good one for a separate PR.
after | ||
couch_stats_resource_tracker:destroy_context() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're handling all the exit cases do we need an extra after
clause?
@@ -60,6 +62,16 @@ process_mailbox(RefList, Keypos, Fun, Acc0, TimeoutRef, PerMsgTO) -> | |||
|
|||
process_message(RefList, Keypos, Fun, Acc0, TimeoutRef, PerMsgTO) -> | |||
receive | |||
Msg -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the rexi message receive pattern by accepting all messages, while previously we left unknown ones in the mailbox. We may crash in case Payload
if we get an unknown message, while previously we left it in the mailbox.
It might be better if we explicitly receive only the messages we expect and not all of them. That was actually the impetus for {rexi, ...}
message patterns. There we bugs related to rexi utils receiving 2 items tuples and ignoring them. That effort only went half-way by adding clauses to receive here but nowhere else. Perhaps we can resurrect it and enhance those {rexi, ... } message to take an extra map with options where delta or other context can do into?
add_delta({A}, Delta) -> {A, Delta}; | ||
add_delta({A, B}, Delta) -> {A, B, Delta}; | ||
add_delta({A, B, C}, Delta) -> {A, B, C, Delta}; | ||
add_delta({A, B, C, D}, Delta) -> {A, B, C, D, Delta}; | ||
add_delta({A, B, C, D, E}, Delta) -> {A, B, C, D, E, Delta}; | ||
add_delta({A, B, C, D, E, F}, Delta) -> {A, B, C, D, E, F, Delta}; | ||
add_delta({A, B, C, D, E, F, G}, Delta) -> {A, B, C, D, E, F, G, Delta}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first sight this looks a bit awkward. Do we have any add_delta({A, B, C, D, E, F}, Delta) -> {A, B, C, D, E, F, Delta};
messages sent to rexi?
It might be better to be explicit about what message rexi will actually get. We may have to do a two-stage PR: One to prep the receivers to accept both old and new patterns, then another one to start sending the new messages.
|
||
|
||
select_by_type(coordinators) -> | ||
ets:select(?MODULE, ets:fun2ms(fun(#rctx{type = {coordinator, _, _}} = R) -> R end)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be hard to maintain. Because we might forget to update the tuple shape when we change it in other place. I suggest to define a record. The data representation would be the same, However record syntax is easier to read and maintain.
-record(worker, {
module :: atom() | '_',
function :: atom() | '_'
}).
-record(coordinator, {
module :: atom() | '_',
function :: atom() | '_'
}).
select_by_type(coordinators) ->
ets:select(?MODULE, ets:fun2ms(fun(#rctx{type = #coordinator{}} = R) -> R end));
select_by_type(workers) ->
ets:select(?MODULE, ets:fun2ms(fun(#rctx{type = #coordinator{}} = R) -> R end)).
OR
-record(worker, {
module :: atom() | '_',
function :: atom() | '_'
}).
-record(coordinator, {
module :: atom() | '_',
function :: atom() | '_'
}).
coordinators() ->
ets:fun2ms(fun(#rctx{type = #coordinator{}} = R) -> R end).
workers() ->
ets:fun2ms(fun(#rctx{type = #worker{}} = R) -> R end).
select_by_type(coordinators) ->
ets:select(?MODULE, coordinators());
select_by_type(workers) ->
ets:select(?MODULE, workers()).
create_worker_context(From, {M,F,_A} = MFA, Nonce) -> | ||
case is_enabled() of | ||
true -> | ||
create_context(MFA, {worker, M, F}, null, From, Nonce); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If worker
record would be defined (as suggested in another comment) this can be written.
create_context(MFA, #worker{module = M, function = F}, null, From, Nonce);
This is more verbose. One option to make it less verbose is to define constructor.
worker({M, F, _A}) ->
#worker{module = M, function = F}.
create_worker_context(From, {M,F,_A} = MFA, Nonce) ->
....
create_context(MFA, worker(MFA), null, From, Nonce);
....
is_logging_enabled() -> | ||
logging_enabled() =/= false. | ||
|
||
logging_enabled() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can cause confusion. The _enabled
suffix reads like a boolean, however this function returns three different values. I am already confused. When I reviewed the function above this one it looked fine, however now I need to go back and think more
is_logging_enabled() ->
logging_enabled() =/= false.
What is the intent ^^^? Do we want to enable logging regardless of it being coordinator
or true
?
In such cases named variants are better.
tracker_subject() ->
case conf_get("log_pid_usage_report", "coordinator") of
"coordinator" ->
coordinator;
"true" ->
all;
_ ->
no_tracking
end.
ok | ||
end. | ||
|
||
is_logging_enabled() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here that is_logging_enabled()
suppose to be true
for both true
and coordinator
.
@@ -222,7 +222,11 @@ stream_ack(Client) -> | |||
%% | |||
ping() -> | |||
{Caller, _} = get(rexi_from), | |||
erlang:send(Caller, {rexi, '$rexi_ping'}). | |||
%% It is essential ping/0 includes deltas as otherwise long running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment.
@@ -53,6 +53,8 @@ | |||
-define(INTERACTIVE_EDIT, interactive_edit). | |||
-define(REPLICATED_CHANGES, replicated_changes). | |||
|
|||
-define(LOG_UNEXPECTED_MSG(Msg), couch_log:warning("[~p:~p:~p/~p]{~p[~p]} Unexpected message: ~w", [?MODULE, ?LINE, ?FUNCTION_NAME, ?FUNCTION_ARITY, self(), element(2, process_info(self(), message_queue_len)), Msg])). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised linter doesn't complain about such long line.
Couch Stats Resource Tracker
This is a rework of PR: #4812
Overview and Motivation
Couch Stats Resource Tracker (CSRT) is a new engine for tracking the amount of
resource operations induced by processes within CouchDB's Erlang VM. This PR
specifically targets coordinator processes and RPC workers induced by said
coordinator processes, but the underlying stats collection framework is designed
to be usable by anything consuming resources in CouchDB, such that we can extend
this out to background jobs like indexing, compaction, and replication. The long
term stretch goal is to be able to account for all system level activity induced
by CouchDB, but the practical goal is to be able to understand where and why
most of the resources in the system are being utilized, at a request/job level
granularity.
This PR is primarily motivated by the current lack of visibility into
identifying what operations are inducing large quantities of work. We have
node/cluster level visibility into the amounts of IO operations being induced,
but we lack the ability to identify what request/operation induced that
workload.
This is especially problematic when dealing with large fold operations that do
local filtering (eg filtered
_changes
feeds or filtered_find
queries)because these operations lack the normal coordinator induced rate limiting that
results naturally from funneling individual results back to the coordinator node
to sort and process. In the local filtering case, we essentially do a direct
fold over the shard and invoke a filter function on that doc to find a matching
result, but in the event the docs fail to match this is essentially a local
tight loop over the shard that when run in parallel can easily dominate IO
operations. The
_find
logic has been extended to generatereport
's to logand identify the heavy hitter requests, especially the degenerative
_find
queries that do a full database scan and find zero results to return.
Approach in this PR and differences from mango reports
This PR takes the idea of the mango reports and creates a unified framework for
tracking these statistics in real time allowing for global querying of node and
cluster level resource usage in the live processes. This PR reflexes an
opinionated deviation from the approach in the mango reports, and instead of
introducing new stats tracking, it proposes the core approach of:
So instead of embedding new stats like in the mango find reports, this system
hooks into the
couch_stats:increment_counter
logic to piggy back off of thestats being collected in real time by the process doing the work, and then
funnels those updates into an ets table keyed off of the local process, and
joined at a cluster level by the coordinator ref, allowing for cluster level
aggregation of individual http requests, live. These tracked stats are forwarded
back to the coordinator process by way of embedding in the
rexi
RPC messagessuch that long running find queries and other heavy weight processes can be
identified and tracked.
We then log a report detailing the total resources induced by the http request
so we can retroactively identify which requests are consuming the most
resources. The reporting by default is configured to only happen at the
coordinator level, but if you're able to handle the log volume it can be enabled
for all workers too. Down the road a nice feature would be supporting writing
reports directly to ClickHouse, some binary format, or even just a terse text
format to allow for increased report volumes; currently high throughput report
generation for coordinator and all rpc workers on high Q databases is
substantial in data volume, but there's much room for optimization given the
verbose nature of the current reports and how well they gzip up.
New metrics introduced
As a result of the above mentioned philosophy of properly tracking the stats
worth tracking, this PR introduces a handful of new stats, predominantly in one
of two forms, listed below. I've also included sample screenshots of the new
metrics plotted during a 30 minute benchmark run that started with an empty
cluster and aggressively created new databases/indexes/documents while growing
worker count progressively during that run. All http traffic was ceased after 30
minutes, and you can clearly see the phase change in operations when that
happened.
eg new stats for counting
couch_btree
reads and writes on kp/kv nodesThe idea here is that we should be tracking 1-3 things for all induced RPC work:
Item 1) is the primary item for core RPC work, this allows us to see the volume
of RPC workers spawned over time per node. Items 2) and 3) are specific to
aggregate operations, with 3) specific to aggregate operations that can perform
local filtering.
The idea is that we can a) see the types of work being induced on nodes over
time, observe how much documents are being processed by the aggregate worker
operations, and then b) directly observe major discrepancies between docs
processed and docs returned, as that's indicative of a missing index or poorly
designed workflows.
Here's a full cluster view of all nodes rpc traffic:
In the case of our benchmark above, the workload was evenly distributed so all
nodes performed similarly. This is a lot of data, but can easily be aggregated
by node or type to identify non-homogeneous workloads. Here's a simpler view
showing per node RPC workloads:
Tracking table for accumulating and querying metrics
The central tracking table is a global ets table utilizing
read_concurrency
,write_concurrency
, anddistributed_counters
, which results in animpressively performant global table in which all processes update their local
stats. Writes are isolated to the process doing the work, so there is no
contention of parallel writes to the same key. Aggregations are performed
against the full ets table, but updates are constrained to a given key are
constrained to the corresponding worker process.
Previous design that failed to scale
In previous PR I attempted to utilize a singular
gen_server
for monitoring theprocesses and performing some cleanup operations. This was optimized down to
only being a dedicated server doing
handle_call({monitor, Pid},..) -> monitor_pid(), {reply, ok, State}). handle_info({DOWN, ..., REF, ...}) -> ets:delete(maps:get(Ref, RefToPid))
and that was insufficient to handle theload. I tried various approaches but I was able to melt a singular
gen_server
easily. It's necessary to have a process monitor outside of the local process
because coordinator/worker processes can and will get killed mid operation,
therefore
after
clause/function based approaches are insufficient.Even with that minimal of a workload, I was able to melt the
gen_server
:and that's with it really doing a minimum workload:
This was my final attempt to make a singular
gen_server
architecture, but with80 core nodes I'm now fully convinced it's no longer viable to do singular
gen_server
systems in hot code paths and we must take more distributedapproaches, either by way of sharding the servers or fully distributed.
Distributed tracker appraoch in CSRT v2
In the case of CSRT, I engaged a fully distributed approach that spawns a
dedicated monitor process when a CSRT context is created by a coordinator or
worker. This monitor process handles the lifetime of a given entry in the ets
table so that we delete the worker entry when the worker is done. This dedicated
monitor process also generates the report afterwards. Switching to the dedicated
monitor approach eliminated the scaling issues I encountered, and the current
architecture is able to readily handle max throughput load.
The CSRT context is created in the coordinator process directly in
chttpd:process_request_int
, and in the worker process directly in thespawned process's initialization of
rexi_server:init_p
. The context isbasically just
erlang:put(?CONTEXT_MARKER, {self(), make_ref()})
which is thenthe ets key used for tracking the coordinator process while it handles the given
request.
The
make_ref()
ensures that the coordinator processes that are reused in theMochiweb worker pool distinguish between individual http requests. More
generally, this allows a long lived process to isolate subsets of its own work.
This is essential if we want to add the ability to funnel the CSRT context
through IOQ/couch_file/couch_mrview/etc to accumulate
A note on PidRef vs nonce for identification
Currently we don't funnel the coordinator's PidRef across the wire and instead
rely on the
nonce
as a global aggregator key, and then the coordinatoraggregations happen directly when the RPC responses are received and the deltas
are extracted. We could add this fairly easily in
rexi:cast_ref
, but I dowonder if we'd be better off skipping the ref entirely and instead using
{self(), Nonce}
as the key given we already funnel it around. That won't workfor background operations, so we'd need a
make_ref()
fallback for trackingthose jobs, but I do like the idea of consolidating down to using the
nonce
given it's already the unique reference to the request inducing the workload,
and we already send it over the wire ubiquitously for all coordinator/worker
operations through
rexi:cast_ref
.Context creation and lifecycle
We create the initial context in
chttpd:process_request_int
/rexi_server:init_p
for the coordinator/workers,respectively, and then we progressively fill in the details for things like
dbname/username/handler_fun so that we can track those data points naturally as
they arise in the request codepath, for example adding the chttp_db handler when
entering those functions, or setting the username after
chttpd_auth:authorize
returns. Similarly, in
fabric_rpc
we piggy back off offabric_rpc:set_io_priority
called by every callback function to cleanly setthe dbname involved in the RPC request. We could also extend this to track the
ddoc/index involved, if any.
The idea is to make it easy for the local process to update its global tracked
state at the appropriate points in the codebase so we can iteratively extend out
the tracking throughout the codebase. Background indexing and compaction are
apt targets for wiring in CSRT and we could even extend the
/_active_tasks
jobs to include status about resource usage.
When we initiate the context, for workers, coordinators, or any future job
types, we spawn a dedicated tracking monitor process that sits by idly until it
gets a
stop
message from normal lifecycle termination, or it gets aDOWN
message from the process doing work. In either case, the tracker process cleans
up the corresponding ets table entry (the only form of two processes writing to
the same key in the ets tracker, but handed off with no interweaving) and then
conditionally generates a
report
to log the work induced.The default, when CSRT is enabled, is to log a report for the coordinator
process totalling the tracked work induced by the RPC workers to fulfill the
given request. It's configurable to also log workers, and there's some
rudimentary filtering capabilities to allow for logging of only a specific rpc
worker type, but this could be improved upon considerably. In general, the
volume of these logs can be sizable, for example a singular http view request
against a Q=64 database on a healthy cluster induces 196 RPC workers, all
inducing their own volume of work and potentially logging a report. A compact
form or additional filtering capabilities to log interesting reports would be
beneficial.
Status and next steps
Overall I'm happy with the performance of the core tracking system, the modern
ETS improvements with distributed counters on top of atomic increments are
really impressive! I had the test suite fully passing recently but I've done a
decent bit of cleanup and restructuring recently so I haven't checked out a full
CI run in a minute, I'll see how it looks on this PR and address anything that
comes up. I'd like to start getting a review here and see what folks think, I
think the structure of the code is in a good place to discuss and get feedback
on. A few next steps to do:
couch_stats_resource_tracker.erl
at the very leasttime
tracking, thetnow()
is not apositive_integer()
let's add a handful of standard performant functions, eg:
rexi_server:init_p
should_increment([M, F, spawned])
vsshould_increment([rexi_rpc, M, F, spawned]
should_increment([couchdb, rpc_worker, M, F, spawned]
Sample report
Filtered changes http request with
?include_docs=true
and JS filter: